How to write SPARQL queries against Freebase data

:BaseKB unleashed

Paul Houle

Creator of database animals and bayesian brains

July 17, 2014

Overview

U-2

Freebase RDF data is clean and well-organized, so it can be straightforward to write queries if you understand how. Although a "cookbook" on the subject doesn't yet exist, this post describes the minimum you need to know to write SPARQL queries against Freebase data.

What to load

Although it is possible to load Freebase data directly into a triple store, it is a difficult process because the Freebase RDF dump is not entirely compatible with RDF standards -- many tools will crash or otherwise fail to load the data. The Freebase RDF dump also contains hundreds of millions of redundant or uninteresting triples that greatly increase both loading and query times.

We use the Open Source Infovore framework to produce :BaseKB, a purified data product which is compatible with RDF standard tools.

We've heard reports of people loading :BaseKB Gold into a number of triple stores including Allegrograph, BigData, and OpenLink Virtuoso. This product is a free download via BitTorrent. The ideal hardware for loading this data is a quad core machine with at least 32GB of RAM and SSD storage.

If you'd like to skip the loading step, which can take hours, you can use the RDFeasy Compact Edition in the AWS Marketplace, which combines OpenLink Virtuoso, :BaseKB data and perfectly matched hardware for a low hourly price. This is an excellent option for evaluation, research, and development, because (1) you can get started in ten minutes and (2) you only need to pay for the time when you're using it.

Prefix declaration

John Hancock

:BaseKB rewrites URIs from the http://rdf.freebase.com/ns/ namespace to http://rdf.basekb.com/ns. Since nearly all of the entities, types, and predicates you'll use come in this namespace, we write

prefix : <http://rdf.basekb.com/ns/>

at the beginning of all queries. If you're using raw data from Freebase, you can write

prefix : <http://rdf.freebase.com/ns/>

and get similar results.

Looking up entities and predicates

Lena River

Let's try a query I was asked about, which is to find the longest river entirely contained in Russia.

In a prefect world we'd have a :BaseKB-powered schema browser, but for now, we can use the Freebase web interface. Go to

http://freebase.com/

and type the word Russia into the autosuggest at the top. You'll see something like

Russia Dropdown

If you click on the first link, you'll get to the country page for Russia, which is

https://www.freebase.com/m/06bnz

and if you look at the head of the page you will see a mid identifier

Russia Detail

You can either read the mid /m/06bnz from the header on the top of page or from the URL of the page. Either way, to use this as an RDF identifier you replace the first slash with a colon, and the second slash with a period to get

:m.06bnz

Now we also need to find two properties to write this query

We need a property that states that a location is completely contained in another location, and
We need a property to find the length of a river.

We can start at the Freebase home page, which lists "bases" that contain common types and properties.

Putting it all together

Circuit Board

Now that we know the properties we need, we can write the following query

prefix : <http://rdf.basekb.com/ns/>

select ?river ?length {
   ?river :geography.river.length ?length .
   ?river :location.location.containedby :m.06bnz .
} ORDER BY DESC(?length) LIMIT 1

If you're using RDFeasy, you can run this in the "Database/Interactive SQL" tab by putting the command 'sparql' in front of the SPARQL, which looks like

Query Snap

and then you get this result

Silvrback blog image

If we convert that mid back to a Freebase detail page we get

https://www.freebase.com/m/0203mm

which is the right answer.

Thinking in RDF

Note that we don't need to put

    ?river a :location.river .

into the query because only a :location.river can be the subject of :location.river.length. This isn't just because Freebase types are organized like base -> type -> property, but because RDFS can infer the above a statement based on

:location.river.length rdfs:domain :location.river .

Much like computer programs (particularly in Java) can grow in verbosity, SPARQL queries can too, and it's wise to leave out any constraints that are unnecessary.